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The 1,852,442-bp sequence of an Ml strain of Streptococcus pyo- 
genes, a Gram-positive pathogen, has been determined and con- 
tains 1,752 predicted protein-encoding genes. Approximately one- 
third of these genes have no identifiable function, with the 
remainder falling into previously characterized categories of 
known microbial function. Consistent with the observation that 5. 
pyogenes is responsible for a wider variety of human disease than 
any other bacterial species, more than 40 putative virulence- 
associated genes have been identified. Additional genes have been 
identified that encode proteins likely associated with microbial 
"molecular mimicry" of host characteristics and involved in rheu- 
matic fever or acute glomerulonephritis. The complete or partial 
sequence of four different bacteriophage genomes is also present, 
with each containing genes for one or more previously undiscov- 
ered superantigen-like proteins. These prophage-associated genes 
encode at least six potential virulence factors, emphasizing the 
importance of bacteriophages in horizontal gene transfer and a 
possible mechanism for generating new strains with increased 
pathogenic potential. 

Streptococcus pyogenes, also known as group A streptococci 
(GAS), is a strict human pathogen, and no other known 
reservoir or species is affected by diseases unique to this 
organism. As a member of the low G+C% family of Gram- 
positive bacteria, this pathogen is responsible for a wide variety 
of disease, including pharyngitis (streptococcal sore throat), 
scarlet fever, impetigo, erysipelas, cellulitis, septicemia, toxic 
shock syndrome, necrotizing fasciitis (flesh-eating disease) and 
the sequelae, rheumatic fever and acute glomerulonephritis. 
Genetic variability is known to occur, as evidenced by the 
appearance of strains associated with outbreaks of infection such 
as necrotizing fasciitis, toxic shock syndrome, and rheumatic 
fever (1-3). The GAS are remarkable for the number of extra- 
cellular proteins produced, many of which have been demon- 
strated to increase the virulence of the organism. These proteins 
often trigger a severe nonspecific immunological response in the 
human host. 5. pyogenes strains are grouped into two classes on 
the basis of postinfectious sequelae associated with each strain, 
class I responsible for rheumatic fever and class II responsible for 
acute glomerulonephritis. Class I organisms, besides being as- 
sociated with poststreptococcal rheumatic fever, possess an 
immunodeterminant contained in a surface-exposed conserved 
(C repeat domain) region of the M protein (class I M protein) 
that is lacking in class II proteins (4). In this report, we present 
the complete genomic sequence of a class I strain of 5. pyogenes. 

Methods 

The S. pyogenes genome sequence was determined by using the 
whole-genome shotgun approach. Two genomic libraries were 
constructed from randomly sheared genomic DNA (I- to 2-kb 
insert and 3- to 5-kb insert), cloned into pUC18 (5) and end 
sequenced with fluorescent terminators by using an ABI377 
(Applied Biosystems) automated DNA sequencer. A third li- 
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brary was constructed from Sau3eJ partially digested genomic 
DNA and cloned into the A replacement vector, ABlueSTAR 
(Novagen). End sequences from A clones were used in deter- 
mining contig linkage for gap closure and final genome link- 
age verification. All sequences were assembled by using the 
phred/phrap/consed software package (http://bozeman. 
mbt.washington.edu) (6, 7). Gap closure was accomplished 
through a primer-walking plasmid template and direct sequenc- 
ing of combinatorial PGR products. 

Initial ORF prediction was accomplished with glimmer 2.0 
by using the default parameters (htlp://www,tigr.org) (8, 9). 
ORFs showing significant overlap were visually examined and 
removed as needed. Initial identification of ORFs was made 
on the basis of blastp analysis against the nonredundant 
protein database. Frame-shift and point mutations were cor- 
rected when appropriate, with ORFs containing sequence- 
verified frameshift or point-verified mutations designated as 
putatively inactive genes. Further identification of ORFs was 
performed through analysis with the pfam (Rel. 4.4) (10), 
COGS (11), and blocks (Blocks Database Ver. 11.0) (12) 
databases. A sequence E value of <10-4 was used as the cutoff 
for all database searches, toppred 2 (13) was used to identify 
transmembrane domains, and SIGNAL? (http://www.cbs. 
dtu.dk/services/SignalP-2,0) (14) was used for prediction of 
signal peptide regions. Functional assignment to COGS cate- 
gories was determined on the basis of the results of the COGS 
database and agreement with results obtained from the other 
database searches. Annotation was accomplished by using the 
Genome Annotation Tool Kit from the Los Alamos National 
Laboratory (Los Alamos, NM). 

Detailed sequencing protocols and methodology are provided 
on our web site (http://microgen.ouhsc.edu/ and at the Uni- 
versity of Oklahoma Advanced Center for Genome Technology 
web site (http://www.genome.ou.edu/proto.html). The com- 
plete 5. pyogenes genome sequence has been deposited in the 
Genome Sequence Database with accession no. AE004092. 
Strain SF370 is available through the American Type Culture 
Collection (ATCC 700294). 

Sequence Analysis 

5. pyogenes strain SF370 was originally isolated from a patient 
with a wound infection and its Ml serotype confirmed seroiog- 
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Fig. 1. Circular representation of the S. pyogenes strain SF370 genome. 
Outer circle, predicted coding regions transcribed on the forward (clockwise) 
DNA strand. Second circle, predicted coding regions transcribed on the reverse 
(counterclockwise) DNA strand. Third circle, stable RNA molecules. Fourth 
circle, mobile genetic elements: burgundy, bacteriophage; blue, trans- 
posons/IS elements; light cyan, transposons/IS elements (pseudogenes). Fifth 
circle, known and putative virulence factors: purple, previously identified 
ORFs; brown, ORFs identified as a result of genome sequence. The lines in each 
concentric circle indicate the position of the represented feature. Colors: dark 
gray, amino acid transport and metabolism; light gray, carbohydrate trans- 
port and metabolism; green, cell division and chromosome portioning; olive 
green, cell envelope biogenesis, outer membrane; salmon, cell motility and 
secretion; tan, coenzyme metabolism; violet, DNA replication, recombination 
and repair; yellow, energy production and conversion; light pink, function 
unknown; rose, general function prediction only; light brown, inorganic ion 
transport and metabolism; light purple, lipid metabolism; light blue, nucle- 
otide transport and metabolism; orange, posttranslatlonal modification, pro- 
tein turnover, chaperones; red, signal transduction mechanisms; cyan, tran- 
scription; green, translation, ribosomal structure and biogenesis; purple, 
virulence factors; magenta, stable RNA; burgundy, barteriophage; medium 
blue, pseudogenes; brown, newly identified virulence factors; blue, trans- 
posons/IS elements. 

ically and by sequence analysis of the emml gene. The Ml 
serotype is among the most prevalent in terms of involvement in 
severe invasive infections and as a class I organism may be 
associated with rheumatic fever. This strain is also known to 
contain an inducible bacteriophage containing streptococcal 
erythrogenic toxin C (speC), also known as pyrogenic exotoxin 
C, but no other previously identified mobile genetic elements. 

The completed genome sequence was derived from over 
42,000 sequence reads generated by the mass sequencing of a 
whole genome shotgun library cloned into pUC vectors (5) 
followed by end-sequencing of a large insert A library and direct 
sequencing of PCR products to facilitate gap closure. The 
average read length was 477 base pairs, and the final genome 
coverage was 9.5-foId. The final contiguous proofread sequence 
had a CONSED calculated accuracy of greater than 99.98%, and 
the deduced physical map is consistent with the previously 
established physical and genetic map of SF370 by Suvorov and 
Ferretti (15). After initial ORE prediction by using glimmer 2.0 
(8, 9) under default settings, annotation was performed by using 
the Genome Annotation Tool Kit from the Los Alamos National 
Laboratory. 

The 5. pyogenes 370 genome is a circular chromosome with a 
size of 1,852,442 base pairs and an average G + C content of 
38.5%. The average G+C content of the protein-coding se- 
quences is 39.1%. Fig. 1 presents a circular map of the chromo- 



Table 1. Distribution of proteins among functional categories 



Functional category ORFs 



Amino acid transport and metabolism 101 

Carbohydrate transport and metabolism 109 

Cell division and chromosome partitioning 21 

Cell envelope biogenesis, outer membrane 57 

Cell motility and secretion 21 

Coenzyme metabolism 32 

DNA replication, recombination and repair 90 

Energy production and conversion 58 

Function unknown 615 

General function prediction only 172 

Inorganic ion transport and metabolism 48 

Lipid metabolism 40 

Nucleotide transport and metabolism 65 

Posttranslational modification, protein turnover, chaperones 38 

Signal transduction mechanisms 41 

Transcription 66 

Translation, ribosomal structure and biogenesis 132 

Virulence factors 46 

TOTAL 1,752 

Stable RNA 79 

Putatively inactive genes 1 5 



some with the direction of transcription emanating in both 
directions from oriC. The starting point of base numbering is 
located at the origin of bidirectional replication adjacent to the 
cinaA gene in Box region C, similar to that described for Bacillus 
subtilis (16). A linear map of the SF370 chromosome is presented 
in Fig. 4 (which is published as supplemental data on the PNAS 
web site, www.pnas.org), along with the putative functional 
designation of each gene in Table 2, which is published as 
supplemental data on the PNAS web site. The genes are 
predominantly transcribed in the direction of DNA replication; 
i.e., genes transcribed in the clockwise direction from oriC to the 
replication terminus represent 83% of the genes, whereas 76% 
of the genes are transcribed in the counterclockwise direction 
from oriC The location of the replication terminus appears to be 
somewhat skewed from the expected position at 180° from onC, 
possibly because of the presence of two complete bacteriophage 
genomes present on one side. A replication termination protein 
and ter site have not been identified at this time. However, a 
putative difAWiC termination sequence, identical to that found in 
many bacteria, including Escherichia coli, is found starting at 
base pair 929,320, roughly at the point opposite oriC (17). This 
sequence, along with recombinases XerC and XerD (SPyll96 
and SPyl092, respectively), most likely plays a role in the 
resolution of newly replicated daughter chromosomes. 

Classification of Gene Products 

Of the total of 1,752 ORFs predicted in the genome, 1,282 (83%) 
could be assigned a putative function or had an identifiable 
homologue from another bacterial species. There are 79 stable 
RNA genes, including 6 rRNA operons. Fully 10% of the ORFs 
(176) are associated with prophage genomes harbored in the 
SF370 chromosome. The greatest extent of similarity to proteins 
from other species in the currently available databases was found 
with B. subtilis, Lactococciis lactis, and various streptococci. 

The overall distribution of protein-coding sequences accord- 
ing to functional groups is presented in Table 1. Metabolic 
pathways present include a complete glycolytic pathway, fatty 
acid synthesis, nucleotide synthesis and transport, and carbohy- 
drate transport and metabolism. Notable in its absence is a 
complete tricarboxylic acid cycle pathway and its accompanying 
electron transport system, consistent with its homofermentative 
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metabolism and the facultative anaerobic environment in which 
this organism resides. Additionally, only a few amino acids are 
synthesized, in accord with the fastidious growth requirements 
of the organism. This synthetic deficiency is offset by scavenging 
resources from the environment; S. pyogenes SF370 has six ABC 
transporters putatively identified as amino acid uptake systems, 
as well as two additional transporter systems that appear to 
mediate the uptake of dipeptides and oligopeptides. 

Regulation and Signaling 

The number of a factors present in bacterial species varies 
considerably; for example, 18 a factors are present in B. suhtilis, 
whereas only 3 are present in the genome of Haemophilus 
influenzae RD (18, 19). S. pyogenes contains a major a factor [cr^'^ 
(rpoD)] as well as an identifiable minor a- factor (homolog of a^-). 
The (7^ (also known as a^"^) is one of the major factors necessary 
for transcription of heat-induced proteins in E. coli (20), and the 
homolog found in 5. pyogenes may play a similar role when the 
organism encounters elevated temperatures in the host. Another 
putative a factor is a homolog of the Streptococcus pneumoniae 
proposed a factor com X that is a transcriptional regulator of 
competence-specific genes (21). A protein with sequence simi- 
larity to the o'-54 modulator protein (SPyl613) is present in the 
genome; however, a (t-54 homolog could not be conclusively 
identified. It may be that this potential regulator interacts with 
one of the identified a factors or may play some other undis- 
covered regulatory role. Overall, the number of rr factors present 
in S. pyogenes (4 probable) is consistent with that found in other 
bacterial pathogens with small genomes, which can range from 
1 to 4 (22). 

As with other organisms, the presence of alternate transcrip- 
tion signals allows the streptococcus to respond to environmen- 
tal changes (23, 24). 5. pyogenes encodes the genes for a number 
of stress-related proteins, which includes several proteases in- 
volved in the stress response and most of the highly conserved 
SOS regulon genes. Highly conserved genes responsible for 
osmoregulation and genes involved in uptake and synthesis of 
the osmoprotectants glycine-betaine, proline, and trehalose also 
are present. In addition to stress pathways common to eubac- 
teria, lactic acid-producing bacteria must deal with acidification 
of their local environment. The principal means of protection 
against acid stress in 5. pyogenes is most likely the action of the 
proton translocating FoFi ATPase;, a mechanism that has been 
shown to efficiently protect Streptococcus mutans against an 
acidified environment (25, 26). Additionally, the arginine de- 
iminase pathway is used by some species of lactococci, strepto- 
cocci, and lactobacilli to survive such a decrease in pH. The 
genes responsible for this system have recently been examined in 
Lactobacillus sakei (27), and an operon resembling this one is 
found in 5. pyogenes. The relA /5/?o 7 proteins are key components 
of the bacterial stringent response. The genomic sequence 
revealed the presence of a gene (SPyl981, relA) encoding a 
bifunctional enzyme involved in the synthesis (Rel-like function) 
and hydrolysis (SpoT-like function) of (p)ppGpp during amino 
acid starvation (28). Thus rel fulfills functions that reside sepa- 
rately in the proteins encoded by the relA and spoT genes of 
E. coli. 

Among 13 identified 2-component regulators, 6 can be as- 
signed to a specific function. Three are sensor-responder pairs 
that appear to be associated with small peptide signaling systems; 
one pair is associated with the salivaricin lantibiotic operon and 
a second with the competence factor response system, ComD 
and ComE (29). The third is the recently described two- 
component regulator (csrS/csrR; covR/covS) that affects the 
expression of streptolysin S as well as hyaluronic acid capsule 
synthesis and pyrogenic exotoxin B expression (30-33). Another 
two-component system (SPy2026 and SPy2027) may be involved 
in bacterial virulence, being positioned near the major virulence 



regulon controlled by Mga and located immediately upstream 
from the immunogenic secreted protein gene, isp (34). An 
additional two-component system (SPy0528 and SPy0529) is 
homologous to YycF-AucG and to hk02-rr02, two-component 
systems that are essential for growth in B, subtilis and 5. 
pneumoniae, respectively (35). 

Thirty-six ABC transporters are found in strain SF370, and the 
roles of many are associated with conserved systems controlling 
the transport of iron and ferrichrome, phosphate, inorganic ions, 
sugars, dipeptides/oligopeptides, and amino acids. One trans- 
port system appears to be dedicated to the uptake of polyamines, 
offsetting the lack of de novo polyamine synthesis. Several of the 
transporters are related to multidrug-resistance/efflux systems 
and may play important roles in environmental stress responses. 
Additionally, the ABC transporter for choline uptake (OpuA 
and OpuB) is present. This system provides the substrate for the 
synthesis of glycine-betaine, an important osmoprotectant. Of 
these 36 transport systems, 8 apparently have alternate ATP- 
binding proteins, and 15 have no readily identifiable substrate 
specificity. 

Two nine-gene operons encoding information for the synthe- 
sis of bacteriocin-like peptide toxins have been identified. The 
first is salivaricin A, a bacteriocin originally described in S. 
salivarius and that is present in 90% of 5. pyogenes strains (36). 
The second is streptolysin S, a pore-forming hemolysin that has 
escaped identification for over 40 years (37). The genetic orga- 
nizations of both of these operons resemble that of the lantibiotic 
nisin produced by L. lactis (38) and the cytolysin of Enterococcus 
faecalis (39). 

Horizontal Gene Transfer, Bacteriophages, and Mobile Genetic 
Elements 

Horizontal gene transfer between bacterial species can occur by 
several mechanisms, including competence-mediated transfor- 
mation and bacteriophage infection. Although many of the 
related streptococci are naturally competent, transformation via 
a competence pathway has never been described for S. pyogenes, 
A number of genes present in strain SF370 specify proteins with 
varying degrees of sequence similarity to competence-related 
genes from the oral streptococci, S. pneumoniae, and B. subtilis. 
In pneumococci, the genes for recA, cinA, and dinf are tran- 
scribed as a single 5.7-kb transcript, where their coordinated 
expression appears to be necessary for efficient incorporation of 
donor DNA during transformation (40). The recA and cinA 
genes of SF370 are also positioned together; however, the only 
possible gene similar to dinf is found in a distant part of the 
genome. Because the products of these genes also mediate 
functions unrelated to transformation, such as SOS repair, any 
role they play in the incorporation of foreign DNA is contingent 
on the presence of other competence genes. Nevertheless, the 
intriguing observation has been made that a binding site for the 
ComX transcription factor for late-competence genes ("cin 
box") is found in the promoter region ofcinA (21, 41). As in S. 
pneumoniae, two copies of comX are present in SF370, each 
positioned next to duplicated ribosomal operons. Additional 
copies of the cin box sequence (TACGAATA) have been 
identified in the genome, some positioned in front of ORFs for 
competence gene homologs such as the B. subtilis comGA. A 
number of other genes similar to competence-related genes from 
several Gram-positive bacteria have been identified including, 
comG ORFs ABCD, comE ORFs C4, and comF ORFs CA (29). 
The crucial late genes for expression of competence, comABC, 
cannot be identified in the genome. Thus, whereas a significant 
portion of the transformation mechanism appears to be present 
in S. pyogenes, whether these genes have ever mediated such an 
event in GAS cannot be determined. 

Bacteriophage and transposon genes account for ^10% of the 
total genome, including the complete or partial sequence of four 
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Fig. 2. %G+C profiles of phage genomes. A plot of the average %G-fC 
(100-base window) along the length of each phage complete genome is 
shown with the residue numbers in the horizontal axis. The regions encoding 
the icnown or putative virulence factors associated with each phage are 
enclosed within the boxed regions; these regions ail show a marked decrease 
in average %G 4-C compared with the remainder of the genomes. Analysis was 
done by using the Genetics Computer Group software package. 

bacteriophage genomes. One of the bacteriophages, identified as 
phage 370.1, may be induced into the lytic cycle by mitomycin C 
treatment (not shown). This phage contains the speC gene and 
an adjacent gene (SPy0712; mf2) that has sequence similarity to 
the previously described streptococcal mitogenic factor 
(SPy2043), as well as to the nucleases EndA (competence- 
specific nuclease) from 5. pneumoniae and streptodornase from 
S. pyogenes (42). A second phage genome (phage 370.2) appears 
to be complete, but attempts to induce the lytic cycle produce no 
phage particles. Analysis of the bacteriophage-related genes 
revealed a point mutation within the putative portal protein that 
results in a stop codon within the coding region that would 
eliminate the ability of this phage to package its chromosome 
into a prohead. Phage 370.2 carries two superantigen-like genes 
identified as speH/I. The third complete phage genome (370.3) 
also appears to be defective because it is also not inducible, 
although no obvious genetic defects have been identified within 
the predicted coding regions. Phage 370.3 also carries two genes 
with implications of horizontal transfer and virulence. These 
genes are located, as is the case for all known phage-associated 
toxin genes, at the end of the phage genome and are transcribed 
in an opposite direction from the bacteriophage genes. The first 
gene is paralogous to the MF2 gene of phage 370.2, with a 
similar, although reversed, order of sequence similarity with MF, 
EndA, and streptodornase. The second gene, although possess- 
ing several predicted transmembrane domains, appears to be 
completely unique to S, pyogenes, unlike anything found in the 
current databases. The fourth phage genome, phage 370.4, is 
incomplete and has an extensive deletion that includes all 
identifiable structural and lysis genes. No virulence-associated 
gene can be identified in this phage genome. Of particular 
interest is the location of the virulence-associated genes near the 
integration site of each complete phage. Because the GC content 
of these genes is in the range of 26-30%, whereas the other 
adjacent phage genes are at or above the 38.5% average GC 
content of the overall chromosome (Fig. 2), it is likely that at 
some point in its evolution, these genes were acquired from an 
unrelated organism and transferred to S. pyogenes. The ubiqui- 
tous presence of phages in the GAS (43) assures the possibility 
of horizontal gene transfer of these virulence determinants, 
playing an important role in increasing the pathogenic potential 
of the organism as well as in its overall evolution. 

Ten predicted transposons or insertion sequence (IS) ele- 
ments are dispersed evenly over the SF370 genome as well as 



seven additional transposons that contain sequence mutations or 
deletions resulting in gene inactivation. The only IS element that 
appears to be directly associated with virulence factors is IS 1562 
(SPy2013), which is associated with the scpA and SIC genes (44). 
Interestingly, one of the inactive IS elements (SPy0858; IS861- 
like element) is located adjacent to a putative gene fragment 
(SPy0860), encoding a peptide with sequence similarity to the 
C-terminal region of SpeC 

Virulence Factors 

Putative virulence associated genes are abundant in the genome, 
with many of the encoded proteins predicted to be localized to 
the cell surface or secreted as extracellular products. Virtually all 
previously identified and sequenced genes of a class I organisms 
were identified in the genome. These genes are located randomly 
throughout the chromosome and are not grouped together as a 
pathogenicity island with the exception of the cluster of viru- 
lence-associated factors in the region of emnt. Extending from 
mitogenic exotoxin Z (smeZ; SPyl998) through the mitogenic 
factor (m/7; SPy2043), this region includes many of the best- 
studied virulence factors and their associated regulatory ele- 
ments such as the Mga regulon. Although it has been proposed 
that this region may be a pathogenicity island (45), it does not 
appear to have the organization of the well-studied virulence 
regions of other bacteria (46). 

Thirteen predicted surface proteins contain an LPXTG motif, 
such as the M protein, protein F, and C5a peptidase (see Table 
3, which is published as supplemental data on the PNAS web 
site). Proteins containing this motif are known to anchor their 
C-terminal ends to the cell surface (47). Although not identified 
as a virulence factor, the T protein is a protease-resistant 
cell-surface protein found in all GAS that is important in 
serological typing. The gene for the serotype 1 T protein in 
SF370 may vary significantly from the previously identified gene 
such that it cannot be identified by gapped BLAST search. All 
genes essential for capsule synthesis, another important GAS 
virulence factor, are present. 

At least six genes encoding new superantigen-like proteins also 
are found, many of which are associated with mobile genetic 
elements, making a total of 14 superantigen-like molecules identi- 
fied to date in the GAS. These known or putative streptococcal 
proteins all have at least one related protein identified from another 
Gram-positive bacterial species, suggesting that these genes may 
have been disseminated by horizontal transfer (Fig. 3). The prod- 
ucts of several of these genes have now been characterized and 
indeed shown to be among the class of superantigens (48). Genes 
encoding previously proposed virulence activities that had not been 
identified or cloned before the start of this sequencing project have 
now been located such as NADase, hyaluronidase, streptolysin S, 
amylase, phosphatase, and proteinase. Several known virulence 
genes have not yet been definitively identified, including those that 
encode the four DNase activities. The sequence similarity between 
MF3, MF2, and MF with EndA and streptodornase, both known 
nucleases, suggests that all might possess DNase activity and 
represent the genes for DNases. It was surprising to find the gene 
for cAMP factor in the GAS genome, as this gene was thought to 
be found only in the GBS or in Streptococcus uheris (49). Tliree 
putative novel hemolysins are present in the genome, having 
similarity to theoretical proteins from either S. mutans ovB. subtilis, 
as well as numerous other known and theoretical hemolysins. A list 
of identified putative virulence factors is presented in Table 4, 
which is published as supplemental data on the PNAS web site. 

In 5. pneumoniae and several other streptococci, the spread of 
resistance to /3-lactam antibiotics in natural populations has 
occurred when segments of penicillin-binding proteins (PBPs) 
from sensitive strains were replaced by homologous blocks 
originating from resistant strains, resulting in gene mosaics. 
These transfers have most likely been mediated by natural 
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Fig. 3. Phylogram of superantigen-like proteins identified in S. pyogenes 
SF370. The protein alignment was generated by using clustalx (by using the 
BLOSSUM matrix and a bootstrap trial of 1,000). The graphical representation of 
the tree was generated by using treeview. Gene products encoded by 5F370: 
red; encoded by 5. pyogenes but not present in SF370: green and encoded by 
S. aureus. Scale bar represents the length of the branches. Bootstrap values are 
displayed at each internal node. Note: SpeK is present in SF370 as a partial 
product only; an Intact copy of speK has not yet been identified in S. pyogenes. 
Gene products encoded by S. pyogenes are in red with those proteins specif- 
ically encoded by strain 5F370 also enclosed in a box. The produrts encoded by 
5. aureus are in blue. GenBank accession nos.: 5. pyogenes proteins: SSA (gb: 
AAA65928. 1); SpeA (gb: AAC48868. 1 ). 5. aureus proteins: SEA (prf: 1 704203A); 
SEB(gb:AAA88550.1);SEC2 (gb:AAA26624.1); SED (gb: AAB061 95.1); SEE (gb: 
AAA26617.1); SEGv (dbj: BAA36693.1); SEH (gb: AAA19777.1); SEI (gb: 
AAC26661.1); SEJ (gb: AAC78590.1); SEL (gb: AAG29598.1); SEM (gb: 
AAG36952. 1 ); TSST (gb: AAA26682. 1 ). Supplementary information is available 
on the world wide web sites for our laboratories at the University of Oklahoma 
(http://www.genome.ou.edu/) and the University of Oklahoma Health Sci- 
ences Center (http://www.microgen,ouhsc.edu/). 



Streptococci show the relatedness of these proteins (see Fig. 5, 
which is published as supplemental data on the PNAS web site); 
however, blocks analysis (50) showed XhaipbplA and pbp2X from 
5. pyogenes contain no lengthy regions of homology with the 
genes from the other streptococci. Thus, the acquisition of 
penicillin resistance by homologous recombination with genetic 
material from a related species is unlikely. Additionally, because 
there is no evidence that GAS are competent for transformation, 
it is probable that penicillin resistance in GAS would have to 
arise de novo. 

Several putative genes encoding proteins with internal repeats 
of the motif sequence Gly-X-Y were identified in the genome. 
These amino acid triplet repeats resemble the characteristic 
repeating sequences found in collagen. Genes SPyl983 and 
SPyl054 encode proteins that contain 50- and 38-aa triplet 
repeats, respectively, and 2 bacteriophage hyaluronidase pro- 
teins each contain 10 of the amino acid triplets. The GC content 
of genes SPyl983 and SPyl054 are 50,3 and 47.1%, respectively, 
both considerably higher than the average of 38.5% for the 
genome. The origin and function of these sequences are un- 
known; however, availability of these proteins to the human 
immune system during infection could possibly lead to antibodies 
directed against collagen in connective tissue. The formation of 
such autoantibodies could result in the polyarthritis generally 
associated with rheumatic fever, one of the postinfection se- 
quelae of a GAS infection, similar to the onset of rheumatic 
heart disease resulting from the crossreactivity of cardiac myosin 
and the M protein (52). Further, at least one of these proteins 
(SclA; SPyl983) has been shown to be expressed on the cell 
surface and under the control of the Mga regulator (53), 
suggesting a link to virulence. 

Conclusions 

The complete sequence of the S. pyogenes genome and the 
resulting initial analysis that reveals the numerous encoded 
virulence factors reflect how this organism has adapted to be an 
obligate and versatile human pathogen. The continued analysis 
of this genome should provide new insights not only into how 
adaptations have shaped the overall genetic organization of the 
GAS chromosome but also into the role regulatory elements play 
in physiologic responses to environmental stress and in the 
expression of virulence factors. Additionally, several approaches 
to developing a GAS vaccine are currently under way (54) and 
further genome analysis coupled with functional genomic studies 
and gene distribution surveys should suggest new or alternate 
candidates, especially from the gene products unique to GAS 
and highly conserved among all strains. The eventual sequencing 
of additional GAS strains, including a class II strain, should 
provide answers to the classical question concerning the differ- 
ence between throat and skin strains (55, 56). The discovery of 
additional new putative virulence factors should allow future 
research to be directed toward answering important questions 
relating to the physiology and pathogenesis of streptococcal 
diseases, which will ultimately lead to improved prevention and 
treatment of these diseases. 



transformation with exogenous DNA and can cross species 
boundaries (50). The two most important PBPs associated with 
penicillin resistance in S. pneumoniae are encoded by phpJ A and 
pbp2X, Phylogenetic comparisons of the S, pyogenes homologs of 
these PBPs to the proteins from 5. pneumoniae and the oral 
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